home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Aminet 7
/
Aminet 7 - August 1995.iso
/
Aminet
/
comm
/
tcp
/
GetURL_1_03.lha
/
GetURL-1.03.doc
next >
Wrap
Text File
|
1995-02-01
|
30KB
|
747 lines
{========================================================================}
{ File Name : 'GetURL.doc', 21-Jan-95 }
{========================================================================}
-- Script to download HTML systems across the network --
{========================================================================}
{ Contents }
{========================================================================}
1. Introduction
2. Installation
3. Use
3.1. Basic Use
3.2. Options
3.3. Help & Setup Options
3.4. Control Options
3.5. Restriction Options
3.6. Input & Output Options
4. I Robot
5. Examples
6. Match
7. New Versions
8. Known Bugs & Problems
9. Glossary
10. Changes
11. Credits
12. Contact
{========================================================================}
{ Introduction }
{========================================================================}
GetURL.rexx is an ARexx script which will download World-Wide Web pages.
With a simple command line it will download a specific page, and with
more complex command lines it will be able to download specific sets of
documents including many pages.
The intention was to create a tool that allowed local cacheing of important
web pages and a flexible way of specifying what pages are important. The
script has no GUI as of yet but may have at some stage in the future.
If you have ever tried to download and save to disc a 200 page document
using Mosaic, then you know what this script is for. Mosaic will only
let you load a page, then load it to disc, then load another page etc.
This is a very frustrating process. GetURL automates this process and
will run in batch mode without user intervention.
The major features of GetURL.rexx are as follows:
* doesn't require AMosaic, so you can be browsing something else
with AMosaic whilst this is running
* save pages to your hard disc so that they can be read offline and
you can also give them to friends on a floppy disc. Who know,
you may even be able to sell discs containing web pages :-)
* flexible set of command line switches that allow you to restrict the
type of pages that it downloads
* ability to specify files for the lists of URLs that it keeps so
that any search for pages can be stopped and restarted at a later
date. i.e. you could run GetURL for 2 hours a day whilst you are
online and gradually download everything in the entire universe
and it won't repeat itself.
* includes the ability to download itself when there are new versions.
* will use a proxy if you have access to one, in order to both speed up
access to pages and also to reduce network load.
* will download binary files (*.gif, *.lha) as easily as text and html
files.
* documentation is in the top of the script file.
{========================================================================}
{ Installation }
{========================================================================}
Just copy the file GetURL.rexx to your REXX: directory
You should also add an assign Mosaic:
e.g.
assign Mosaic: PD:Mosaic/
TimeZone
========
If you want to use the -IfModified flag (which is **VERY** useful)
then you should also configure the TimeZone.
Use a normal text editor, find the line which looks like
gv.timezone = ''
and enter your TimeZone expressed as a difference to Greenwich Mean Time
(England Time) e.g. I am in Melbourne so my TimeZone is GMT+1100
so I put
gv.timezone = '+1100'
If I was in Adelaide I would be in teh TimeZone GMT+1030
gv.timezone = '+1030'
Note: Anywhere in the USA is going to be GMT-???? so make sure
you get it right.
- If you are in England then put +0000
- Don't put symbols like EST or whatever, put it numerically.
Match
=====
Although not necessary, GetURL will perform better in the presence
of 'Match', 'rexxarplib.library'
Match should be in your search path. The simplest way to do this is
to copy it to your C: directory
RexxArpLib.library is available somewhere on AmiNet
{========================================================================}
{ Use }
{========================================================================}
Basic Use
=========
The basic use of GetURL is to download a single page from the World-Wide
Web. This can be achieved by doing
rx geturl http://www.cs.latrobe.edu.au/~burton/
This will download the page specified by the URL into an appropriately
named file. For this example the file will be called
Mosaic:www.cs.latrobe.edu.au/~burton/index.html
The required directories will be created by GetURL if necessary.
Options
=======
GetURL has many command line options which allow you to do much
more interesting things. The following is a discussion of each
option individually. The names of the options can all be abbreviated
but this may be unwise as you may be specifying a different option
than you think.
Help & Setup Options
====================
-Help
e.g.
rx geturl -help
Prints a summary of all the options
-Problem
e.g.
rx geturl -problem
Allows a bug report or gripe or problem to be entered in from
the CLI.
-NewVersion <file>
rx geturl -newversion t:geturl.rexx
Downloads a new version of GetURL from my university account (assuming
the university link hasn't gone down or something). Don't save the new
copy over the old copy until you know it has been downloaded properly.
-PatternMatching
Downloads 'Match' from my university account. Match allows GetURL to use
pattern matching in the restriction options (see the section on Match below)
-Associative
uses a different scheme to keep lists of URL addresses which is quite a bit
faster
-Delay
set the delay between loading pages in seconds (defaults to 2 seconds)
Control Options
===============
-Recursive
This causes GetURL to search each downloaded file for URLs and to then
fetch each one of these pages and search those. As you can guess this
will fill up your hard disc downloading every page on the planet (and
several off the planet also). In fact this is what is called a Web Robot
and should be used with due caution. Press control-c when you have had
enough. GetURL will finish only when it's list of unvisited URLs is empty.
-NoProxy
Normally GetURL will try to use a proxy server if you have one set up.
You can set up a proxy server by adding something like the following to
your startnet script.
setenv WWW_HTTP_GATEWAY http://www.mira.net.au:80/
setenv WWW_FTP_GATEWAY http://www.mira.net.au:80/
Where 'www.mira.net.au' is replaced by your local proxy host and '80'
is replaced by the port number to which the proxy is connected. Proxies
are normally connected to port 80.
The NoProxy option causes GetURL to talk directly to the host in the URL
rather than asking the proxy to do so. This should only be necessary if
your proxy host isn't working properly at the moment.
-NoProxyCache
Asks the proxy host to refresh it's cache if necessary. Whether the
proxy host takes any notice of this flag or not it it's problem.
-Retry
GetURL keeps a list of URLs that it couldn't access. If this flag is set,
when no further URLs are available to visit, GetURL will make another
attempt to fetch each of the pages that failed before.
Restriction Options
===================
The behaviour of each of these options depends on whether you have
'Match' installed or not. (see the section on match below)
Note: pattern are not case sensitive.
-Host <pattern>
Allows you to specify or restrict URLs to certain hosts.
e.g.
rx geturl http://www.cs.latrobe.edu.au/~burton/ -host #?.au
This will try to download any URLs connected to my home page that
are in the Australian domain '.au'
The pattern following '-host' can be any AmigaDOS filename pattern.
See the DOS Manual for a full description of these patterns but just
quickly here are a few examples
#? - means 'any sequence of characters'
(a|b) - means 'either option a or option b'
? - means 'any single character'
~a - means 'any string that does not match a'
More specifically
-host #?.au - means 'any host name ending in .au'
-host (info.cern.ch|www.w3.org) - means 'either info.cern.ch or www.w3.org'
-Path <pattern>
This works in the same manner as the -Host option except that you are
describing the pathname component of the URL
e.g.
rx geturl http://www.cs.latrobe.edu.au/~burton/ -recursive -path ~burton/NetBSD/#?
This will try to download the NetBSD documentation from my university
account.
Note: don't start the path with a leading '/' character as a '/' is
automatically inserted between the host and path parts of the URL
e.g.
rx geturl http://ftp.wustl.edu/ -recursive -path pub/aminet/#?
rx geturl http://www.cs.latrobe.edu.au/~burton/ -path #?.(gif|jpg|xbm)
-URL <pattern>
This works in the same manner as the -Host option except that you are
describing the whole URL
e.g.
rx geturl http://www.cs.latrobe.edu.au/~burton/ -recursive -url http://www.cs.latrobe.edu.au/~burton/#?
-Number <num>
Will only download <num> files across the network (not including the initial
URL supplied on the command line)
e.g.
rx geturl http://www.cs.latrobe.edu.au/~burton/ -recursive -number 5
This will download my home page and 5 other pages which are mentioned
in it.
-Length <num>
Will only download a file across the network if it is smaller than
<num> bytes
-IfModified
Will only download a file across the network if it is newer than the
file in the cache. If there is no appropriately named file in the cache
directory then the file will be downloaded anyway.
-Depth <num>
Will only download files across the network until a recursion depth of
<num> is reached. The first time a page is analyzed for URLs that page
has depth 0, the URLs found in it have depth 1. The URLs found in pages
of depth 1 have depth 2 and so on. This gradually develops into a tree
of Web pages, with the initial URL as the root (level 0) each URL found
in the root page hanging from the root (level 1) and the contents of their
pages hanging from them. This option allows GetURL to stop at the desired
depth of the tree.
Without Match
The documentation above assumes that Match is installed. If match
is not installed then AmigaDOS patterns will not work. Instead you
can use the '*' character as a place holder.
e.g.
-host *.cs.latrobe.edu.au
- meaning 'any host in the cs.latrobe.edu.au domain'
-path */index.html
- meaning 'a file index.html in any single level directory'
-url http://www.cs.*.edu/*/index.html
- meaning 'a file called index.html in a single level directory on
any WWW host in any U.S. university'
Input & Output Options
======================
-Input <filename>
Usually GetURL will start by downloading a page from the web and
save it. This option allows you to specify a file on your hard
disc to start with.
e.g. assuming t:temp.html contains a list of URLs to search
rx geturl -input t:temp.html -r
This will search recursively starting with the specified file
-Output <filename>
Instead of saving the downloaded files into sensibly named files
under Mosaic: this option allows you to append the material downloaded
into a single file.
Mostly useful for downloading a single file.
e.g.
rx geturl http://www.cs.latrobe.edu.au/~burton/ -output James.html
-SaveRoot <dir>
If you don't want the downloaded files to be saved under Mosaic:
you can redirect them to another directory with this command
e.g.
rx geturl http://www.cs.latrobe.edu.au/~burton/ -saveroot tmp:cache
-Visited <file>
GetURL keeps a list of URLs that have already been visited. This
allows GetURL to stop itself repeating by checking each new URL
against the contents of the list of visted URLs. If you specify a file
(it may or may not exist previously) with this option - that file will
be used to check new URLs and to save URLs that have just been visited.
e.g.
rx geturl http://www.cs.latrobe.edu.au/~burton/ -visited tmp:Visited.grl
-UnVisited <file> use specified unvisited file, visit these URLs first
Normally GetURL will be used to download a starting file, search it for
URLs and then visit each URL found in turn searching each of those files
for further URLs. Each time GetURL finds a new URL it will append it to
a file so that it can come back later and download that page. This option
can be used to specify a file (it may or may not previously exist) which
will be used for this purpose. If there are already URLs in the specified
file these will be visited before any new URLs which are found.
-Failed <file>
When for some reason GetURL fails to download a file, the URL of that file
is added to a list. This option causes GetURL to use the file specified
(it may or may not exist previously) to store that list.
-SaveHeaders
When a file is retrieved using the HTTP protocol, the file is accompanied
by a header, rather like an email message. This option causes GetURL to
keep the header in a suitably named file.
e.g.
rx geturl http://www.cs.latrobe.edu.au/~burton/ -saveheaders
will save 2 files
Mosaic:www.cs.latrobe.edu.au/~burton/index.html
Mosaic:www.cs.latrobe.edu.au/~burton/index.html.HDR
{========================================================================}
{ I Robot }
{========================================================================}
The Problems with Robot
=======================
GetURL is considered a 'robot' web client. This is because it can
automatically download an indeterminate number of pages or files over
the network. That is what a robot is. The power of a robot can be
abused very easily, far more easily than it can be used for reasonable
purposes. We all hate what has come to be called 'spamming' by commercial
entities on the 'net', but abuse of a robot is no different. Indiscriminate
robot use can make an entire network utterly useless for the period
during which the program is run (possibly for a great deal longer).
It is possible for a robot program to download hundreds of pages per
minute. In order to help people make the best use of GetURL without
overloading network resources and without annoying system administrators
worldwide, GetURL has some builtin rules for containing the behaviour
of the robot.
Rules for Containing the Robot
==============================
Once the robot has started off exploring the Web, it is important that
it be restrained. There are two sorts of restraints in effect. Ones
that you can control via the restriction options (described above), and
some that you cannot control. Because GetURL is supplied as source code
it is asked that you do not remove the restraints described here, but
accept them. The web is a global resource with no equivalent of the
police as of yet. So in order to leave it usable by other people you
must use self restraint.
* Database queries
- any URL containing a question mark '?' is disallowed. This means
that GetURL will not make database queries or provide parameters
for CGI scripts. The reason being that the return from a query could,
and very often does contains URLs for further queries leading to an
infinite loop of queries. It is simpler to completely disallow any
URL containing a '?' than to distinguish reasonable URLs.
* Site exclusion
- not implemented yet
* Proxies
- by default GetURL tries to use a proxy if one is configured
This means that multiple fetches of the same page or fetching
pages that are commonly fetched by other users are optimised.
* Delay
- by default GetURL waits 2 seconds between fetching files across
the network. This stops GetURL fetching lots of pages in a short
time making it's impact on the network negligable.
* Non-optimal implementation
- GetURL could be implemented in a much more efficient fashion.
Both in terms of choice of compiler and choice of algorithm.
In future I would like to redesign GetURL and redevelop it in
C or a similar language, but at the moment the poor implementation
results in somewhat of a restriction.
More about Robots
=================
The best thing to do if you need to know more about Robots is to read
the following page with Mosaic.
http://web.nexor.co.uk/mak/doc/robots/robots.html
<A HREF="http://web.nexor.co.uk/mak/doc/robots/robots.html">World Wide Web Robots, Wanderers, and Spiders</A>
{========================================================================}
{ Examples }
{========================================================================}
** Examples of use of this wonderful program :-) (please send me yours)
1) rx geturl -h
print help message
2) rx geturl http://www.cs.latrobe.edu.au/~burton/ -output t:JamesBurton.html
get James' home page, save to file
3) rx geturl http://info.cern.ch/ -recursive -host info.cern.ch -visited t:visited
fetch all pages on info.cern.ch reachable from http://info.cern.ch/, Keep a list
4) rx geturl -directory uunews: -unvisited t:news-urls
search for URLs in all news articles, save to file, but don't visit them yet
5) rx geturl -problem
make a suggestion, ask for help, send bug report
6) rx geturl -NewVersion t:geturl.rexx
download most recent version of the script (not wise to put rexx:geturl.rexx)
7) an ADOS script called 'continue'
.key url
.bra [
.ket ]
rx geturl [url] -visited tmp:geturl/visited -unvisited tmp:geturl/unvisited -recursive -saveroot tmp:geturl/cache -failed t:geturl/failed
;; call it every time you log on, and it will fetch a few more pages. Send it ctrl-c
;; to stop. Shouldn't repeat itself. add -retry to make it attempt to retry
;; URLs it couldn't get before (those in the failed file)
8) rx geturl http://www.cs.latrobe.edu.au/~burton/PigFace-small.gif
download my portrait - just that and nothing more
9) rx geturl -PatternMatching
download Match utility
10)rx geturl http://www.cs.latrobe.edu.au -r -path #?.(gif|jpg|xbm)
download picture files in my home page
{========================================================================}
{ Match }
{========================================================================}
Match is a small utility I wrote in C and compiled with DICE v3.0
The purpose of Match is to make AmigaDOS pattern matching available
from within scripts. I couldn't find a simple way of making ARexx do
this by itself, but I found an incredibly simple way using C. GetURL
does not require Match, but is considerably more useful with Match.
Match works as follows:
Match <pattern> <string>
e.g.
Match #? abcde
yes
Match #?.abc abcde
no
Match #?.abc abcde.abc
yes
Match prints either 'yes' or 'no' depending on whether the string matches
the pattern. Or in other language, if the pattern correctly described the
string Match wil priny 'yes' otherwise 'no'.
You can get hold of Match by using the following command line
rx geturl -patternmatching
Match should be installed somewhere in your shell's search path. The simplest
way to do this is to copy match to your C: directory
{========================================================================}
{ New Versions }
{========================================================================}
GetURL is still suffering from 'featuritis'. This means that every
day I think of 5 new problems that GetURL could solve if only it had
such and such an option. To make it worse, since I released GetURL onto
the network, other people have been suggesting things that I hadn't
thought of. The point is that GetURL is updated regularly. So I made
GetURL able to download a new version of itself.
If you use the command line
rx geturl -newversion
GetURL will attempt to download the latest version of itself. New versions
may appear as often as once a week, especially if bugs or problems are
found.
Please mail me if you have any bugs, problems or suggestions. The easiest
way to do this is to use the command line
rx geturl -problem
{========================================================================}
{ Known Bugs & Problems }
{========================================================================}
** Problems, ToDos, ideas, ... (please suggest more)
1) http://www.cs.latrobe.edu.au != http://www.cs.latrobe.edu.au/
these count as separate - should be recognised as the same
2) warning if download an empty file
4) check content-length: header (add a nice display)
5) check content-type:
6) FILE://localhost - should change local_prefix temporarily
7) should delete temporary files
- including visited.tmp files
8) when use -output the WHOLE file is searched every time a page is appended
9) would be nice to spawn off HTTP processes
10) need better search engine (grep?) I tried c:search but it's no use
11) make it smaller (takes too long to compile)
12) implement missing protocols, FTP notably
13) implement -directory, -update
14) implement -date <date> (only downloads pages newer than date)
15) convince Mosaic developers to add ability to check Mosaic: cache
before downloading over the network
16) write HTTP cache server for AmiTCP (oh yeah sure!)
17) host/pathname aliases file
18) clean up an existing agenda file, removing things that are in the
prune-file and other matching URLs
19) CALL not used in function call where a value is returned
{========================================================================}
{ Glossary }
{========================================================================}
World-Wide Web (or WWW for short)
For many years computers around the world have been connected together
in what is known as the InterNet. This means that you can have real-time
as-you-speak access to somebody else's computer. The Web uses this facility
to access information on other people's computers (maybe in another country)
and to present this information to you. GetURL is a CLI or shell command
that allows you to browse the Web from your Amiga without human intervention.
i.e. automatically
URL (Universal Resource Locator - or web-address for medium)
This is quite a complex way of saying what file you want. I won't attempt
to describe it in full but here are a few examples.
Note: GetURL probably won't understand all of these
http://host[:port][/path]
describes a file accessable via the HTTP protocol
ftp://host[:port][/path]
describes a file accessable via the FTP protocol
telnet://host[:port]
describes a telnet login
gopher://host
describes a file or directory accessable via Gopher
mailto:user@host
describes an email address
Proxy
Because millions of people are now using the Web, the network is
suffering overload symptoms. Normally HTTP specifies that pages
are downloaded directly from the place in which they are stored,
but if you connect to a proxy server, than when you go to download
a page your client (maybe GetURL or AMosaic) will ask the proxy
host for the page. The proxy host either already has the file
in it's cache (in which case that saves loading it again across the network
to the proxy) or it goes and fetches it normally. Either way you end up
getting the file, and on the average quite a bit quicker.
AmiNet
A group of computers which each have a set of programs for Amiga computers.
Hundreds of thousands of Amiga users download files from AmiNet every day
Patterns
A way of describing something. Computer programs use patterns to decide
whether something is of interest
from a shell try
list #?.info
This will list all the files in the current directory that match the
pattern '#?.info' See the AmigaDOS manual for more information.
Case Sensitive
The pertains to patterns. If a pattern is case sensitive then capital a
'A' will only match with another 'A' and not 'a'. If a pattern is not case
sensitive (or case insensitive) then capitals and lower-case are considered
the same letter, so 'A' matches with 'a'. Patterns in GetURL are
case-insensitive because that's how the Match program works.
{========================================================================}
{ Changes }
{========================================================================}
History of changes
version 0.9 : initial beta release 08-Jan-95
version 1.0 : 15-Jan-95
- fixed problem in header parsing - now stops at lines containing cr 09-Jan-95
- fixed problem with name of header_file being 'HEADER_FILE' sometimes 09-Jan-95
- fixed : initial URL always added to visited file without check 09-Jan-95
- now will still run -recursive if no initial URL 09-Jan-95
- added -saveroot 09-Jan-95
- implemented FILE method to localhost 09-Jan-95
- added -number 10-Jan-95
- added -newversion 15-Jan-95
- added -failed 15-Jan-95
- added -retry 15-Jan-95
- added top of file data for HTML files only 15-Jan-95
- fixed -output 15-Jan-95
version 1.01 : 16-Jan-95
- fixed openlibrary problem 16-Jan-95 Brian Thompstone <brian@clrlight.demon.co.uk>
- fixed relative URLs problem 16-Jan-95 Brian Thompstone <brian@clrlight.demon.co.uk>
- fixed output file mode problem 16-Jan-95
version 1.02 : 26-Jan-95 (Australia Day)
- added X-Mailer header 16-Jan-95
- added AmigaDOS regexp in -host, -path 16-Jan-95
- added Match downloader 17-Jan-95
- fixed HREF & SRC missed because capitals 19-Jan-95
- fixed mailto: ending up as a relative 19-Jan-95
- wrote GetURL.doc properly 22-Jan-95
- added GetURL.doc to -NewVersion 22-Jan-95
- changed global vars to gv.* style 26-Jan-95
- added -LENGTH (thanks again Brian) 26-Jan-95
- added -SaveHeaders (thanks again Brian) 26-Jan-95
- added -IfModified 26-Jan-95
- changed problem address back to Latcs1 26-Jan-95
- added associative 26-Jan-95
- added skip of the top of HTML files so that you don't always
end up searching my home page :-) 26-Jan-95
version 1.03 : 01-Feb-95
- added -DELAY 29-Jan-95
- now refuses to work on any URLs containing a '?' 29-Jan-95
- only SRC & HREF links accepted (was accepting NAME) 29-Jan-95
- added configuration for TimeZone 29-Jan-95
- Finally, items are only added to agenda if not already there 29-Jan-95
- same for pruned, failed
- fixed a few problems with -DEPTH and implemented it for
non-array agenda 30-Jan-95
- added robot section to documentation 31-Jan-95
{========================================================================}
{ Credits }
{========================================================================}
GetURL.rexx was originally written by James Burton after a discussion with
Michael Witbrock of AMosaic fame. GetURL is public domain. Thanks to Brian
Thompstone for
(i) actually using it
(ii) actually telling me what was wrong with it
(iii) writing some fixes and additions
(iv) enthusiasm
James Burton <burton@cs.latrobe.edu.au>
Brian Thompstone <brian@clrlight.demon.co.uk>
Michael J. Witbrock <mjw@PORSCHE.BOLTZ.CS.CMU.EDU>
{========================================================================}
{ Contact }
{========================================================================}
James Burton
c/o
Department of Computer Science & Computer Engineering
Latrobe University
Bundoora, Victoria, 3083
Australia
EMail: burton@cs.latrobe.edu.au
Web: http://www.cs.latrobe.edu.au/~burton/
{========================================================================}
{ End of File 'GetURL.doc' }
{========================================================================}